Smoking is one of the leading causes of preventable deaths globally, contributing significantly to a wide range of health problems, including cardiovascular diseases, respiratory conditions, and various cancers. Over the years, many countries have implemented measures to reduce smoking prevalence, such as public health campaigns, tobacco taxes, smoking bans in public areas, and regulations on tobacco advertising. Despite these efforts, smoking continues to affect millions of people worldwide, with its prevalence varying significantly across regions and populations. By investigating the prevalence of smoking among adults globally from 2000 to 2020, this study aims to explore and identify patterns or disparities in smoking behaviours. Understanding these trends is crucial for informing future public health strategies and reducing the global burden of tobacco-related diseases.
How does the prevalence of smoking in adults across the world vary in the dataset, and what trends or patterns emerge from this analysis?
The project repository includes: - Codebook: Detailed variable descriptions. - Data Folder: Contains the raw and processed datasets. - Figures Folder: Visualizations created during analysis. - Scripts Folder: R scripts for cleaning, analysis, and plotting.
The PSY6422_smoke repository is organized into key sections to help you navigate its contents. The /codebook folder provides detailed documentation on the dataset, including variable descriptions and structure, offering essential context for the analysis. The /data folder contains the raw datasets used in this project, forming the basis of all analyses. The /figures folder showcases visualizations and plots created during the analysis, highlighting the project’s key findings and insights. Lastly, the /scripts folder includes all the code used for data processing, analysis, and visualization. Together, these sections guide you through the project workflow, from raw data to final outputs.
The raw dataset for this visualization project comes from Multiple sources compiled by World Bank (2024) – processed by Our World in Data. “Prevalence of current tobacco use (% of adults)” dataset. World Health Organization (via World Bank), “World Development Indicators” [original data]. Source: Multiple sources compiled by World Bank (2024) – processed by Our World In Data
This data package contains the data that powers the chart “Share of adults who smoke” on the Our World in Data website https://datacatalog.worldbank.org/search/dataset/0037712/World-Development-Indicators
The percentage of the population ages 15 years and over who currently use any tobacco product (smoked and/or smokeless tobacco) on a daily or non-daily basis. Tobacco products include cigarettes, pipes, cigars, cigarillos, waterpipes (hookah, shisha), bidis, kretek, heated tobacco products, and all forms of smokeless (oral and nasal) tobacco. Tobacco products exclude e-cigarettes (which do not contain tobacco), “e-cigars”, “e-hookahs”, JUUL and “e-pipes”. The rates are age-standardized to the WHO Standard Population.
These considerations are important when interpreting the project’s results Estimates for countries with irregular surveys or many data gaps have large uncertainty ranges, and such results should be interpreted with caution.
# List of packages to install and load
packages <- c("tidyverse", "ggplot2", "tidyr", "dplyr", "plotly", "rnaturalearth", "rnaturalearthdata", "sf", "htmlwidgets")
# Function to install packages and load them
install_and_load <- function(packages) {
for (package in packages) {
if (!require(package, character.only = TRUE)) {
install.packages(package, dependencies = TRUE)
library(package, character.only = TRUE)
} else {
library(package, character.only = TRUE)
}
}
}
# Run the function
install_and_load(packages)
# Load raw data / replace "data/smoking.csv" with the path to your CVS file.
rawdata <- read.csv("data/smoking.csv")
Summary and structure of the dataset before cleaning.
#sanity check
str(rawdata) # Inspect structure and summary
summary(rawdata)
head(rawdata) # Check the first few rows
dim(rawdata) # Check dimensions
colSums(is.na(rawdata))# Check for missing values
# Check for duplicates
duplicates <- rawdata[duplicated(rawdata), ]
print(duplicates)
# Unique values in key columns
unique(rawdata$Entity)
unique(rawdata$Year)
The dataset initially included various entities, such as global regions and income levels. To focus on country-level analysis, these entities were excluded. Additionally, the data was streamlined to 5-year intervals for clarity, which involved removing the years 2018 and 2019. For simplicity and readability, the variable representing prevalence was also renamed.
# Cleaning the data
# Remove specific entities
countries_data <- rawdata[!rawdata$Entity %in% c("East Asia and Pacific (WB)", "Sub-Saharan Africa (WB)",
"Upper-middle-income countries", "Europe and Central Asia (WB)",
"World", "European Union (27)", "Low-income countries",
"Lower-middle-income countries", "Middle East and North Africa (WB)",
"Middle-income countries", "North America (WB)", "South Asia (WB)",
"Latin America and Caribbean (WB)", "High-income countries"), ]
# Exclude specific years
countries_data <- countries_data %>%
filter(!(Year %in% c(2018, 2019)))
# Rename column for ease of use
countries_data <- countries_data %>%
rename(Prevalence = Prevalence.of.current.tobacco.use....of.adults.)
In this section, the world map data was loaded using the ne_countries() function, which provides a geospatial dataset containing geographic and country-level information. To ensure compatibility with the smoking data, discrepancies in ISO country codes between the world map dataset (world) and the smoking prevalence dataset (countries_data) were identified. Specifically, any codes present in countries_data but missing in world, and vice versa, were highlighted. To address these discrepancies the function fix_iso_codes() was created to correct mismatched ISO codes.
# Load world map data
world <- ne_countries(scale = "medium", returnclass = "sf")
# Identify codes in countries_data but not in world
missing_in_world <- setdiff(countries_data$Code, world$iso_a3)
print(missing_in_world)
# Identify codes in world but not in countries_data
missing_in_data <- setdiff(world$iso_a3, countries_data$Code)
print(missing_in_data)
# Fix missing ISO codes
fix_iso_codes <- function(world_data) {
world_data %>%
mutate(iso_a3 = ifelse(name == "France", "FRA", iso_a3)) %>%
mutate(iso_a3 = ifelse(name == "Norway", "NOR", iso_a3))
}
world <- fix_iso_codes(world)
To prepare the datasets for visualization, the world map data and the cleaned smoking prevalence data were merged using the left_join() function. The datasets were aligned by matching the ISO country codes (iso_a3 in the world dataset and Code in the smoking dataset). During this process, any missing prevalence values were replaced with 0 using the mutate() function, ensuring that all countries had defined values for visualization, even if data was unavailable. Duplicate entries were identified and removed to maintain data integrity.
To make the data compatible with interactive visualizations in Plotly, the geometric information in the sf object was removed using the st_set_geometry(NULL) function. A detailed inspection of the merged dataset was performed. Finally, distributions of data across years and countries were tabulated to confirm consistency and readiness for visualization. This step ensured a clean, structured dataset suitable for creating accurate and informative visualizations.
# Merge datasets, sanity check/clean merged data
# Merge datasets
map_data <- world %>%
left_join(countries_data, by = c("iso_a3" = "Code"))
# Replace NA prevalence values with 0
map_data <- map_data %>%
mutate(Prevalence = ifelse(is.na(Prevalence), 0, Prevalence))
# Remove duplicates
map_data <- map_data[!duplicated(map_data), ]
# Remove geometry for Plotly compatibility
plot_data <- map_data %>%
st_set_geometry(NULL)
# Structure and summary of the cleaned dataset
str(plot_data)
summary(plot_data)
# Check for duplicates in country-year pairs
duplicates <- plot_data %>%
group_by(iso_a3, Year) %>%
filter(n() > 1)
print(duplicates)
# Check for missing prevalence values
missing_prevalence <- plot_data %>%
filter(is.na(Prevalence))
print(missing_prevalence)
# Distribution of years and countries
table(plot_data$Year)
table(plot_data$iso_a3)
The cleaned dataset consists of country-level information on smoking prevalence among adults from 2000 to 2020, focusing on 5-year intervals. Key variables include Entity, which represents the name of each country; Code, the unique ISO-3 country code used for data alignment; Year, indicating the specific year of observation; and Prevalence, which provides the percentage of the adult population currently using tobacco products. The data has been meticulously cleaned to exclude regional or global aggregates and irrelevant entities, ensuring it accurately reflects country-level trends. Furthermore, missing values in prevalence have been accounted for, and duplicate entries removed, resulting in a comprehensive and structured dataset ideal for visualizing smoking trends over time.
The initial visualization provided a basic representation of smoking prevalence on a static map. While it successfully displayed the prevalence rates for countries, it lacked interactivity and failed to account for changes across years. Users could not explore how smoking prevalence evolved over time or access specific data points for each country. The map presented a single snapshot without the ability to delve into detailed information, such as individual country names or exact prevalence percentages. These limitations highlighted the need for a more dynamic and informative visualization to enhance user engagement and understanding.
ggplot(data = map_data) +
geom_sf(aes(fill = Prevalence), color = "white", size = 0.2) +
scale_fill_gradient(low = "blue", high = "red", na.value = "grey") +
labs(title = "Global Smoking Prevalence Heatmap",
fill = "Prevalence (%)") +
theme_minimal()
My next visualization aimed to be more interactive. This time, I managed to get individual interactive visualizations for each year. However, this did not allow us to visualize the evolution across time in a single plot.
# Function to create an interactive plot for a specific year
# Create a list of subsets by year
yearly_subcategories <- split(plot_data, plot_data$Year)
# Combine all yearly subcategories back into a single dataset
combined_data_year <- bind_rows(yearly_subcategories)
create_interactive_year_plot <- function(data, year) {
plot <- ggplot(data = data %>% filter(Year == year)) +
geom_sf(aes(fill = Prevalence, text = paste("Country:", Entity, "<br>Prevalence:", Prevalence, "%")),
size = 0.2) + # Removed black borders
scale_fill_gradient(low = "lightyellow", high = "darkred", na.value = "white", name = "Prevalence (%)") +
labs(
title = paste("Global Smoking Prevalence -", year),
fill = "Prevalence (%)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16),
legend.position = "bottom",
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)
# Convert ggplot to interactive plotly object with hover text
ggplotly(plot, tooltip = c("text"))
}
# Create interactive visualisations for each year
interactive_plot_2000 <- create_interactive_year_plot(map_data, 2000)
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
interactive_plot_2005 <- create_interactive_year_plot(map_data, 2005)
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
interactive_plot_2010 <- create_interactive_year_plot(map_data, 2010)
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
interactive_plot_2015 <- create_interactive_year_plot(map_data, 2015)
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
interactive_plot_2020 <- create_interactive_year_plot(map_data, 2020)
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
# Display the interactive plots
interactive_plot_2000
interactive_plot_2005
interactive_plot_2010
interactive_plot_2015